28

Quantization of Neural Networks

TABLE 2.1

Evaluating the components of Q-ViT based on the ViT-S backbone.

Method

#Bits

Top-1

#Bits

Top-1

#Bits

Top-1

Full-precision

32-32

79.9

-

-

-

-

Baseline

4-4

79.7

3-3

77.8

2-2

68.2

+IRM

4-4

80.2

3-3

78.2

2-2

69.9

+DGD

4-4

80.4

3-3

78.5

2-2

70.5

+IRM+DGD (Q-ViT)

4-4

80.9

3-3

79.0

2-2

72.0

2.4

Q-DETR: An Efficient Low-Bit Quantized Detection Trans-

former

Drawing inspiration from the achievements in natural language processing (NLP), object

detection using transformers (DETR) has emerged as a new approach for training an end-to-

end detector using a transformer encoder-decoder [31]. In contrast to earlier methods [201,

153] that heavily rely on convolutional neural networks (CNNs) and necessitate additional

post-processing steps such as non-maximum suppression (NMS) and hand-designed sample

selection, DETR tackles object detection as a direct set prediction problem.

Despite this attractiveness, DETR usually has many parameters and float-pointing op-

erations (FLOPs). For instance, 39.8M parameters comprise 159 MB memory usage and

86G FLOPs in the DETR model with ResNet-50 backbone [84] (DETR-R50). This leads

to unacceptable memory and computation consumption during inference and challenges

deployments on devices with limited resources.

Therefore, substantial efforts on network compression have been made toward efficient

online inference [264, 260]. Quantization is particularly popular for deploying AI chips by

representing a network in low-bit formats. Yet prior post-training quantization (PTQ) for

DETR [161] derives quantized parameters from pre-trained real-valued models, which often

restricts the model performance in a sub-optimized state due to the lack of fine-tuning

on the training data. In particular, the performance drastically drops when quantized to

ultra-low bits (4 bits or less). Alternatively, quantization-aware training (QAT) [158, 259]

performs quantization and fine-tuning on the training dataset simultaneously, leading to

trivial performance degradation even with significantly lower bits. Though QAT methods

have been proven to be very effective in compressing CNNs [159, 61] for computer vision

tasks, an exploration of low-bit DETR remains untouched.

In this paper, we first build a low-bit DETR baseline, a straightforward solution based

on common QAT techniques [61]. Through an empirical study of this baseline, we observe

significant performance drops on the VOC [62] dataset. For example, a 4-bit quantized

DETR-R50 using LSQ [61] only achieves 76.9% AP50, leaving a 6.4% performance gaps

compared with the real-valued DETR-R50. We find that the incompatibility of existing

QAT methods mainly stems from the unique attention mechanism in DETR, where the

spatial dependencies are first constructed between the object queries and encoded features.

Then a feed-forward network feeds the co-attended object queries into box coordinates

and class labels. A simple application of existing QAT methods on DETR leads to query

information distortion, and therefore the performance severely degrades. Figure 2.8 exhibits

an example of information distortion in query features of 4-bit DETR-R50, where we can see

significant distribution variation of the query modules in quantized DETR and real-valued

version. The query information distortion causes the inaccurate focus of spatial attention,

which can be verified by following [169] to visualize the spatial attention weight maps in 4-

bit and real-valued DETR-R50 in Fig. 2.9. We can see that the quantized DETR-R50 bear’s